GenitivDB ― a Corpus-Generated Database for German Genitive Classification
نویسنده
چکیده
We present a novel NLP resource for the explanation of linguistic phenomena, built and evaluated exploring very large annotated language corpora. For the compilation, we use the German Reference Corpus (DeReKo) with more than 5 billion word forms, which is the largest linguistic resource worldwide for the study of contemporary written German. The result is a comprehensive database of German genitive formations, enriched with a broad range of intraund extralinguistic metadata. It can be used for the notoriously controversial classification and prediction of genitive endings (short endings, long endings, zero-marker). We also evaluate the main factors influencing the use of specific endings. To get a general idea about a factor’s influences and its side effects, we calculate chi-square-tests and visualize the residuals with an association plot. The results are evaluated against a gold standard by implementing tree-based machine learning algorithms. For the statistical analysis, we applied the supervised LMT Logistic Model Trees algorithm, using the WEKA software. We intend to use this gold standard to evaluate GenitivDB, as well as to explore methodologies for a predictive genitive model.
منابع مشابه
Syntactic Analyses for Parallel Grammars: Auxiliaries and Genitive NPs
This paper focuses on two disparate aspects of German syntax from the perspective of parallel grammar development. As part of a cooperative project, we present an innovative approach to auxiliaries and multiple genitive NPs in German. The LFG-based implementation presented here avoids unnessary structural complexity in the representation of auxiliaries by challenging the traditional analysis of...
متن کاملAn Unsupervised System for Identifying English Inclusions in German Text
We present an unsupervised system that exploits linguistic knowledge resources, namely English and German lexical databases and the World Wide Web, to identify English inclusions in German text. We describe experiments with this system and the corpus which was developed for this task. We report the classification results of our system and compare them to the performance of a trained machine lea...
متن کاملGenerating data as a proxy for unavailable corpus data: the contextualized sentence completion task
There is much interest in using large corpora to explore predictors of the probability of higher level linguistic structures, but suitable corpora are not available for all languages and their varieties. We explore a task that uses discourse contexts from an existing corpus as prompts for sentence completion to investigate the usefulness of the method for generating data as a proxy for unavaila...
متن کاملAn XML-based Tool for Tracking English Inclusions in German Text
The use of lexicons and corpora advances both linguistic research and performances of current natural language processing (NLP) systems. We present a tool that exploits such resources, specifically English and German lexical databases and the World Wide Web to recognise English inclusions in German newspaper articles. The output of the tool can assist lexical resource developers in monitoring c...
متن کاملMarkedness and Blocking in German Declensional Paradigms
The loss of regular case endings in modern German has led to highly syncretic noun paradigms that neutralise many of the distinctions retained in more conservative determiner and adjective paradigms. Genitive and dative are, for all intents and purposes, the only cases marked in noun paradigms. Strong nonfeminine nouns have a genitive singular in -s. Strong nouns whose plural ends in a schwa or...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014